Difficulties in A/B Testing

Being able to determine the statistical significance of performance differences in A/B test results is valuable. However, there are many other factors to consider to ensure your A/B tests are successful. In the real world, designing, running, and drawing conclusions from an A/B test to lead you to the right decisions can be tricky.

In the following quizzes, you'll find three scenarios where Audacity conducted an A/B test that led to a poor decision. Think about what went wrong and what could've been done to avoid these outcomes.

Scenario #1

EXPERIMENT: Audacity tests a new layout in the classroom to see if it would help engage students. After running an A/B test for two weeks, they find that average classroom times and completion rates decrease with the new layout, and decide against launching the change.
REALITY: What they don't know, is that the classroom times and completion rates actually increase significantly for new students using the new layout. In the long run, the layout would help existing students too, but they are currently experiencing change aversion.

SOLUTION:

The experiment included existing users who would bias results in a short time frame.
The experiment wasn't run long enough to allow existing users to adjust to the change.

Scenario #2

EXPERIMENT: Audacity tests a new feature on their landing page that guides users through their course selection process and makes recommendations. After running an A/B test for a month, they find that the click through rates on course pages increase (enrollment rates increase) with the new feature, and decide to launch the change.
REALITY: What they don't know, is that although the number of total enrollments increase with the new feature, the courses they are purchasing are almost exclusively shorter and cheaper courses, which brings down the revenue for Audacity. It turns out the feature is leading more students to choose more courses with smaller commitments.

SOLUTION:

Enrollment rates alone was not the best metric for this experiment.
Their metric(s) didn't account for revenue, which is ultimately what they want to increase with their decision.

Scenario #3

EXPERIMENT: Audacity tests a new description for a difficult course that gets very few enrollments. They hope this description is more exciting and motivates students to take it. After running an A/B test for five weeks, they find that the enrollment rate increases with the new description, and decide to launch the change.
REALITY: What they don't know, is that although the enrollment rate appears to increase with the new description, the results from this A/B test are unreliable and largely due to chance, because fewer than 40 out of thousands of visitors enrolled during this experiment. This makes even one new student for the course substantially impact the results and potentially even the conclusion.

SOLUTION:

This course page had too little traffic and conversions to produce significant and repeatable results in this time frame.

Difficulties in A/B Testing

As you saw in the scenarios above, there are many factors to consider when designing an A/B test and drawing conclusions based on its results. To conclude, here are some common ones to consider.

Novelty effect and change aversion when existing users first experience a change
Sufficient traffic and conversions to have significant and repeatable results
Best metric choice for making the ultimate decision (eg. measuring revenue vs. clicks)
Long enough run time for the experiment to account for changes in behavior based on time of day/week or seasonal events.
Practical significance of a conversion rate (the cost of launching a new feature vs. the gain from the increase in conversion)
Consistency among test subjects in the control and experiment group (imbalance in the population represented in each group can lead to situations like Simpson's Paradox)